Power Programmierung

home *** CD-ROM | disk | FTP | other *** search

/ Power Programmierung / Power-Programmierung (Tewi)(1994).iso / magazine / drdobbs / 1990 / 11 / e_floyd / applicat.txt < prev next >

Wrap

Text File | 1990-07-24 | 8KB | 152 lines

Applications Spelling Checker: With superimposed coding, we can build a spelling checker that has some unusual properties. First, use the results of one of the studies of word usage frequency to build a "primary" dictionary bit table from the 10000 most frequently used English words. At 13 bits per key, the optimal bit table will occupy about 23k bytes and will have a false drop probability of about 1/8000. Then, partition the remaining words by first letter into lists of 10000 or fewer words. Create a "secondary" bit table for each partition and save it on disk. The spelling checker program can keep the primary bit table in memory and can allocate an auxiliary, 23k buffer for one of the secondary tables. The program will find most words in the primary bit table. For rare or misspelled words, the program will usually have to read a bit table from disk into the auxiliary buffer. Still, it should be fast enough to keep up with real-time typing. Overall, the checker will fail to catch about 1/4000 misspelled words, the same as Doug McIlroy's program. Now we have a complete, conventional spelling checker with good accuracy, but small enough (under 60k) to reside constantly in memory, perhaps as a DOS TSR. If we allocate a primary bit table for 12000 words (28k), the program will be capable of learning up to 2000 new words without compromising its design target for accuracy. Document retrieval: To help researchers locate information, electronic libraries index documents and abstracts by author and subject keyword search terms. Sometimes as many as ten search terms are associated with a single document. Instead of a conventional index, we can use superimposed coding to speed up the search. Build for each document a small bit table superimposing five bits for each index term. This bit table is sometimes called a "surrogate" or "signature"; I'll call it a surrogate. The surrogate may be quite small, say 128 bits. The false drop rate for a 128-bit surrogate with ten, 5-bit keywords is less than (50/128)@SUPERSCRIPT[5], or 1/110. Let's say we want to locate all the documents associated with the keyword "programmer". We could test each surrogate for the keyword as described above. However, there's no need to recompute the hash function and bit positions for the search keyword each time. Instead, build a 128-bit search table, like a surrogate but with only the 5 bits "on" for the single word "programmer". If a document surrogate has even one bit "off" that is "on" in the search table, the document can't possibly be associated with the keyword "programmer". We can compare the search table with each surrogate using Boolean logic: Perform a logical "AND" between each document's surrogate and the search table and compare the result to the original search table. If they are not equal, we can reject the document immediately. If they are equal, we must investigate the documentèfurther to determine whether or not it is associated with the search term. 1001011000101.. Surrogate for document 1 0001001001000.. Search table ───────────── AND 0001001000000.. Reject document 1 ^ 0010010111010.. Surrogate for document 2 0001001001000.. Search table ───────────── AND 0000000001000.. Reject document 2 ^ ^ 0101011011010.. Surrogate for document 3 0001001001000.. Search table ───────────── AND 0001001001000.. Investigate document 3 If we want to search for a conjunction of two terms, say "programmer" AND "sociopath", the same technique applies. Build a search table with bits for both terms. With each surrogate, perform the logical "AND" and compare. This time, all ten bits in the search table must also be "on" in the document table. The false drop probability is now about (1/110)*(1/110), or 1/12100. So, the search becomes more efficient as we search for more conjoined terms. To search for a disjunction of two terms, say "programmer" OR "hacker", we must build two search tables and investigate further any document whose surrogate satisfies either search. The false drop probability, therefore, is half the probability of a single-term search, or about 1/55. By extending these techniques, we can construct complex queries. One shortcoming - we can't use surrogates to speed up negation (NOT) in a query. Fortunately, in document retrieval, negation is frequently associated with positive conditions that restrict the set of documents anyway. Finally, we can save search time by employing a peculiar "sideways" ordering of surrogate bits. Gather together all the first bits from a group of surrogates and store them in document order at the beginning of the surrogate file. Follow this with all the second bits, and so on. Think of the file as columns of surrogate bits. Now we can search the group of document surrogates in parallel. Simply "AND" together the bit columns corresponding to the "on" bits in the search table and the result is a bit map whose "on" bits correspond to documents that we must investigate further. If we block surrogates in groups of 4096 documents, each bit column will occupy 512 bytes - one sector on many modern disk subsystems. This kind of grouping simplifies calculations and optimizes hardware performance for bit column retrieval. In the example fragment shown below, for instance, it is necessary to read only the fourth, seventh, and tenth sectors of the surrogate file to compute a bit map for the first 4k documents. è 0 0 0 1 0 0 1 0 0 1 0 0 0... Search table Sur | AND | AND | = Bit Map --- V V V ------- 1. 1 0 0 1 0 1 1 0 0 0 1 0 1... 0 Reject 2. 0 0 1 0 0 1 0 1 1 1 0 1 0... 0 Reject 3. 0 1 0 1 0 1 1 0 1 1 0 1 0... 1 Investigate 4. 0 1 1 0 0 0 1 1 0 0 0 0 1... 0 Reject 5. 1 0 0 1 0 0 1 0 1 1 0 0 1... 1 Investigate . . . . . . . . . . Database: We can use surrogate files instead of conventional indices in database applications, also. For example, consider a million-record database with fifty fields in each record. The average size of each field is 10 bytes, so the database size is about 500 megabytes. We want to index the database on all fifty fields. As an optimistic estimate, a conventional index would require about 15 megabytes for each field for a minimum total index size of 750 megabytes. The index is larger than the database. On the other hand, an optimal surrogate for 50 keys at 10 bits per key is about 90 bytes for a total surrogate file size of 90 megabytes. The surrogate file is almost an order of magnitude smaller than a conventional index. It is likely to be slower than a conventional index for single-field queries. However, for complex queries involving Boolean combinations of field values, it can be considerably faster than an conventional index. (Implementation note: Distinguish between fields by prefixing each field value with a domain identifier before hashing bits for the surrogate.) When the database is fairly stable and a large proportion of queries against the index result in a record-not-found condition, a superimposed code dictionary can save time. For example, consider a 10000 record database indexed on one field. A superimposed code dictionary for 10000 keys with 10 bits per key is about 18k bytes - small enough to hold in memory. Before searching the index for a key, check for its existence in the dictionary. If it's not in the dictionary, there's no need to search the index. We will have reduced unnecessary index searches by a factor of a thousand.